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Abstract 

Several technological applications require the translation of a protein into a nucleic 
acid that codes for it ( "backtranslation" ) . The degeneracy of the genetic code makes 
this translation ambiguous; moreover, not every translation is equally viable. The 
common answer to this problem is the imitation of the codon usage of the target 
species. Here we discuss several other features of coding sequences ( "coding statis- 
tics") that are relevant for the "genomic style" of different species. A genetic algo- 
rithm is then used to obtain backtranslations that mimic these styles, by minimizing 
the difference in the coding statistics. Possible improvements and applications are 
discussed. 
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1 Introduction 



The main components of the cell are nucleic acids (DNA and RNA) and pro- 
teins. Both are polymers, long words written in alphabets of 4 and 20 letters: 
4 nucleotides for DNA and RNA, and 20 amino acids, for proteins. The "fun- 
damental dogma" of molecular biology describes the usual flow of information 
in the cell, from DNA to mRNA to protein. The first step, transcription, pre- 
serves the sequence read from DNA, which is reversed and complemented in 
the mRNA (in addition, the alphabet is slightly changed). It is straightfor- 
ward to obtain the DNA from a given mRNA (it is called then complementary 
DNA, or cDNA); in fact. Nature does it: retrotranscription is performed by 
viruses and several small "selfish" units of information. 
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The second step, translation, is more complicated: the mRNA is read, three 
nucleotides at a time, and an amino acid encoded by them is added to the 
forming protein, according to the well known genetic code (see Table 1). This 
nearly universal code associates to each triplet {codon) an amino acid, or the 
"stop" meaning. 



Table 1: The (standard) Genetic Code 
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Unlike retrotranscription, the reversal of this second step (called backtrans- 
lation) is ambiguous, due to the degeneracy of the genetic code: as can be 
seen in Table 1, amino acids are encoded by 1, 2, 3, 4 or 6 different codons. 
Backtranslation does not occur in natural systems ^ , but is required for sev- 
eral purposes in genomics and biotechnology. The problem is not trivial, since 
different species have different "genomic styles" that determine which of the 
many preimages is used to code for a protein. Thus it may happen that we 
know the DNA for a given protein produced by, for instance, a plant, but we 
want to synthesize the protein in a bacterium[31]. We will need to backtrans- 
late the protein into the genomic style of this kind of bacteria. In other cases, 
the protein is known but no DNA is known for it at all; this may happen with 
artificial proteins, or with proteins from unsequenced organisms. Other ap- 
plications, like degenerate primers (for "gene fishing" ) and sequence analysis, 
will be discussed in the last section. 

The best known statistical feature of coding sequences is the presence of a 
periodicity of period 3, which is caused by the structure of the genetic code 
and the asymmetry of the different codon positions [14, 21]. This property is 
very important for distinguishing coding from non-coding sequences; however, 
it is not important for backtranslation, since it is shared by all organisms. On 
the other hand, we know that codon usage (the degree of preference for the 
different codons inside each synonymous class) does distinguish one species 
from another; it is the best known feature of the different "genomic styles" . 

The common approach to backtranslation relies on the imitation of the codon 
usage of the target species (the species whose style we want to imitate) [28]. 
This is the solution currently given by all commercial and non-commercial 

^ Though [27] suggests that it did occur at the origin of life, and even proposes an 
in vitro device for backtranslation. 



2 



software, like GCG, EMBOSS, VectorNTI, EditScq, AiO, and the online tools 
of Molecular Toolkit and Entelechon. The only different approach we know is 
[36] , where a neural network was trained to perform backtranslation. However, 
it was done at the single amino acid level, and thus it cannot account for 
anything but codon usage. 

This current solution can be improved; there are more features peculiar to 
the different coding styles[ll,18], which are in part or completely independent 
from codon usage [10]. In the present article, we consider different possible 
statistics that may be associated to genomic styles, and then we apply a ge- 
netic algorithm to perform backtranslation, taking these features into account. 
Our approach considers DNA only as a symbolic sequence, ignoring chemical 
properties or biological features. Furthermore, wc will not use biological con- 
siderations to decide whether or not a statistical property needs to be imitated: 
we assume that any property distinguishing the style of a species must be con- 
sidered in backtranslation (after all, in some cases the origin of known fcatiires 
remains obscure). All the statistics we consider were taken from the literature 
on sequence analysis, where their possible interpretations are discussed. 



2 Notation, Materials 



Let ^={A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y} and 
B—{a,c,g,t} be the alphabets for amino acids and nucleotides, respectively, 
and denote B^* = (B^)*. Let r : B^* {A U {stop})* be the translation 
of a sequence according to the genetic code. In fact, r may depend on small 
variations to the code which do occur in some species and organelles; however, 
here we will assume the code to be universal. Furthermore, we will consider 
the sequences without the start and stop signals, i.e., cutting the atg codon 
that initiates a protein and the stop codon that marks its end. 

We will say that a function (or stochastic procedure) (3 : A* ^ B^* is a 
guess iff r o /3 = id^.. If C C A*, we will denote j3{C) = {(3{u) : u G C}. 
A particular guess that will be used for comparison purposes is the canonical 
backtranslation procedure, which backtranslates each amino acid using the 
empirical frequencies of its codons as probabilities; we will denote it as f3gp, 
with the subindex indicating the species whose codon usage table was used. 

Given a sequence w e B^*, w — Wq, Wi, . . . and i — 1,2, 3, we will talk about 

the letters in codon position i to refer to Wi^i, Wi^2, w^t+s, • • • • We will denote 
with Tiry, n^s and iimk the three most usual projections of B into {0, 1}, as 
follows. We will use the same symbols to refer to the extensions of these 
functions to B^* (projecting each letter). 
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a c g t 


refers to: 




10 1 
110 
11 


purine / pyrimidine 
weak/strong 
amino/keto 



It is important to notice that many characters in are almost or completely 
determined by u. Amino acid K, for instance, is coded by aaa and aag; the first 
and the second position will be a in any backtranslation, and the third one will 
be either a or g (and will have Tiry = 0, so that for any f3, Tiry{l3{K)) = 001). 
The next table shows the number of amino acids for which characters are fixed 
in the different codon positions for the different binary alphabets. Most of the 
ambiguity of backtranslation is in the third position. 





TTry T^ws T^mk 


Cod. Pos. 1 
Cod. Pos. 2 
Cod. Pos. 3 


18 18 18 

19 20 19 
11 2 2 



Materials 

We extracted coding sequences from Gcnbank[3] release 131 (August 2002), 
belonging to the following species: Methanosarcina acetivorans C2A (Al), Sul- 
folobus solfataricus (^2), Escherichia coh (-B1), Bacillus subtihs (-B2), Strep- 
tomyccs coelicolor A3(2) (S3), Mesorhizobium loti (fi4), Nostoc sp. PCC 7120 
(i?5), Saccharomyces cerevisiae (-El), Arabidopsis thaliana (£'2), Drosophila 
mclanogaster (-E'3), Caenorhabditis elegans {EA) and Homo sapiens {E'o). The 
selection of species was done trying to have abundant sequences and a rather 
good representation of the tree of hfe. All coding sequences ( "CDS" features 
in Genbank) were extracted, provided that they were complete, univoque, and 
longer than 1029 nucleotides. The average length of the sequences varies be- 
tween 1500 for Al and 2456 for £'3. Please notice that introns -intervening 
sequences- were removed from the sequences; this may affect the coding statis- 
tics that depend on relations between distant nucleotides. We will use the ab- 
breviation of a species to refer to the set of its coding sequences, or to the set of 
the corresponding proteins, depending on the context. Thus, an expression like 
j3'^^^{Eb) denotes a set of backtranslations obtained for all proteins encoded by 
the coding sequences of i?5, obtained by the standard backtranslation method, 
considering the codon usage of SI. 
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3 Coding Statistics 



Here we discuss the results of computations performed on our set of species for 
several features that have been studied in coding sequences, "generally known 
as coding statistics, since their behavior is statistically distinct on coding and 
non-coding regions" [10]. Discussions about the most common coding statistics, 
their relations, and their use for gene finding, can be found in [11] and [18]. 
However, we are not interested in the difference between coding and non- 
coding regions; rather, we want those statistics that contribute to the "genomic 
style" of a species. 

The notion of genomic style has been around since the "genome hypothesis" of 
Grantham [8,9], who first recognized the idiosyncratic nature of codon usage. 
Later, Karlin used the bias in dinucleotide usage as the "genomic signature" 
of a species [19]. Forsdykc suggests that the species "broadcast" their genes in 
different g + c frequencies [6], and that this could be crucial for speciation; in 
this way, genomic styles could be the first line of an immune system ^ . There 
have been other proposals, usually for phylogenetic purposes. The reasons for 
the existence of different styles are debatable: for instance, changes in the 
molecular machinery, tRNA abundance, environmental temperature, different 
biases in the mutation rates, the requirements of messages other than the 
protein sequences[35], etc. The exact causal relations are subject to discussion. 

In order to improve the profile of genomic styles, we want to choose those 
statistics which: (1) have typical and statistically sound values for each species, 
with small variability, (2) have different values in different species, and (3) do 
not depend (exclusively) on the amino acids encoded by a sequence (i.e., they 
do depend on backtranslation) . Because of space limitations, we will not give 
the values of all computations; in the graphics, not all the species will be 
displayed, if it is not required. Moreover, we will dispense from data in the 
case of well known facts. All computations and data sets can be found at [1] . 

3. 1 Nucleotide frequencies 

The most natural computation is the frequency of the four nucleotides in the 
sequences, as well as their frequencies in the different codon positions. For each 
sequence w e B^*, w — wo, ■ ■ ■ , w^n-i, and each nucleotide a, we compute 

^ 3N-1 ^ N-1 

Pa{w) = ^ Sa{Wi) , p{{w) = — Sa{w3i+j_i),j ^ 1,2,3 

'^-'^ 1=0 -'^ i=0 

^ Indeed, [5] shows that some viruses may mimic the genomic style of their host, 
in order to be expressed. 
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where Sa{x) is 1 if x = a and otherwise. Our computations confirm a number 
of facts already known in the hterature, hke "Chargaff's second law", which 
states that pa ~ Pt and pc ~ Pg as can be observed in Graphic la. Since, in 
addition, J^ogb Pa = 1, Chargaff's law implies positive correlation between 
complementary nucleotides (a with t, and c with g) and negative correlation 
between non-complementary ones. Thus we can reduce the study to a single 
value; the usual choice is pg+c = Pc + Pg- It is well known that pg+c has different 
values in different species, and that all the genes in a species have similar 
values; this can be seen in Graphic lb, with histograms showing the number of 
sequences of each species in different p^+c ranges. Some qualifications are due: 
First, it is also known that eukaryotic genomes are organized in large "islands" 
called isochores [24], with different pg+c values but each of them relatively 
homogeneous. Moreover, in a set of closely related species Pg+c may depend 
more on the genes than on the species[23]. However, the general pattern holds, 
and it is used both for the detection of genes (since genes tend to be Pg+c-richer 
than non-coding regions) and in the detection of horizontally transferred genes 
(see section 5). 




Fig. 1. (a) Nucleotide frequencies, (b) Histograms for Pg+c- (c) Pg+c in different 
codon positions. 



Graphic Ic shows the values of p^+c = pi + pi for the different species, together 
with Pg+c- We notice the existence of wide variations in the pg+c composition 
depending on the codon position. In addition, extreme values of pg+c are usu- 
ally supported by extreme values of p^+c; this shows that the sequences were 
adapted to get a certain pg+c level, and that the third -usually synonymous- 
codon position was used for this purpose. As can be seen in Table 2, p^_,_^ and 
p^_,_^ are almost entirely determined by the encoded amino acids. 
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3.2 Codon usage 



The frequency of a given codon C — Cq, Ci, C2 G 5^ in a sequence w — 

Wo,...,W3N-i e B^* is defined as ^ E^q^ (5co(w3i)5ci (w3i+i)42(^^'3*+2)- For 
each codon C G B^, we define its synonymous class 6{C) = {C G -B^ : 
r(C) = r(C")}. Then the synonymous codon usage and the relative synony- 
mous codon usage [29] of C are defined as 



SCUc = Ji"" , , RSCUc = = \9{C)\SCUc 

2^ Pc 2^ Pc 

c'&e{c) c'&e{C) 



As we mentioned above, the codon choice pattern was noted very early to 
be a signature of the species, and our data confirm this. We will dispense 
with extensive SCU tables, since they are well known in the literature, and 
available in public databases [26]. As we said before, the common approach to 
backtranslation uses SCU as the probability of choosing a certain codon, given 
the amino acid. RSCU is used for comparisons between codons from different 
synonymous classes. 



3.3 Dinucleotides 



Most pubhshed results on dinucleotide frequencies consider long DNA se- 
quences, including both coding and non-coding regions [4,12,30]. Our own 
computations, in spite of being limited to coding sequences, confirm most of 
the facts already noted by the different authors. This accounts for the fact 
that dinucleotide frequencies are not considered as "coding statistics": their 
behavior is similar in coding and in non-coding sequences. However, they do 
exhibit characteristic patterns according to the different species and groups. 
Karlin [19] even used them to define the genome signature of a species as the 
collection {Qa^}, with a and /? ranging over B. Here Qap = Pap/ PaPp (with 
pap being the frequency of the dinucleotide q;/3) and q* is the computation of 
Q over the sequence concatenated to its inverse complement (in order to get 
the information about both DNA strands). 

IDH. There is an interesting set of indices which can be computed from 
dinucleotide frequencies. The so called index of DNA homogeneity (IDH) was 
proposed by Miramontes et al [25] and is defined for a binary sequence as 
^ ^ pooPn-^PHiPio ^ ^ggj^g dryi^r) = d{7iry{w)), d^s{w) = d{7r^s{w)), and 
dmk{w) = d{'K„ii^{w)). This index expresses the degree of local homogeneity of 
the sequence: long stretches of or 1 will cause d to be near 1, while strong 
alternation will push it toward -1. The three indices dry, d^s and dmk are not 
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independent, and since Tr^fc is the least meaningful of the binary projections, 
the choice in [25] was to plot the species in the {dry, (i^s) plane. The corre- 
sponding map with our own data is in Graphic 2a. Graphic 2b displays the 
distribution of the values in the sequences of some species. Both the specificity 
and the classificatory power of IDH can be clearly noted. 




Fig. 2. (a) Position of species in the {d^y^duis) plane, (b) Histograms for IDH in 
some species. 



3.4 Fourier harmonics and Periodicities 



Another common tool for DNA analysis is the discrete Fourier transform [22]. 
For a binary sequence w = Wq, . . . ,wn^i, we define the spectrum and its 
m-smoothed version: 



N-l 

E2 IT i n k 
WkC 

k=0 



N 



n+m 



s: 



w 



. ^ 2m + 1 

k=n—m 



Sniw) measures the frequency content of 'frequency' n, which corresponds to 
a period — ; the smoothed value helps to remove the dispersion that appears 
for small data sets. 

The main and better known periodicity in DNA sequences is of period 3; it 
can be explained by the asymmetry in the codon positions [14,21], though its 
presence in tRNA genes suggests some other origin. Another well documented 
periodicity is of period 10.5 ± 0.5; it has been attributed to requirements from 
the structure of both DNA and proteins, and the exact contribution of each 
is unclear. Some periodicities of higher periods have been shown, but they are 
not statistically significant for the typical lengths of genes. 

We divided each sequence in non-overlapping windows of length 256, and 
used the fast Fourier transform (FFT) algorithm to compute o n^y, o vr^^ 
and 5"^ o n^^. for all the species. The results were averaged and are shown in 
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Fig. 3. for (a) tt^s, (b) TTry and (c) Wmk- 



Graphics 3a, 3b and 3c for some of the species; only part of the ordinate axis is 
used, in order to highhght their differences. The two periodicities mentioned 
before are present: there is a big peak at n = 85 for the three projections 
in almost all the species (the top of the peaks is outside the graphics); this 
corresponds to a period of ^ ~ 3. There is also a minor peak around n = 24, 
present for most species and for most projections, corresponding to the period 
^ 10.5; there are some differences between species, a fact that has been 
observed before and is related to the various origins of this periodicity. 

To show the specificity of the spectrum, we chose a set of 20 collections of 
sequences, each set selected at random to be 1% of E5. We computed the 
average of spectra for each set; the results for tt^s are shown in Graphic 4a. 

Position dependent spectra. To take into account the asymmetry of the 
different codon positions, we computed the spectra for the three subsequences 
wl^^ = w^n+i, i = 0,1,2, using windows of length 64 (data not shown). In 
absence of period 3, the most notorious feature is a peak at n = 18, cor- 
responding to a period ~ 3.5 in the subsequence, and hence 10.5 in the 
sequences; it is by far stronger for the middle codon position, a fact that hints 
for dependence on the amino acid sequence. 




Fig. 4. Dispersion of (a) 5^ and (b) f for -Kyis in Eb. 
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3. 5 Autocorrelation functions 

Correlation functions [13,15] measure the excess or defect of nucleotides at 
different distances; if Pa,f3{d) is the frequency with which we find a d posi- 
tions after a then what we compute is p^,/? ~ PaPp- More precisely, what 
we compute for a sequence w = wq, . . . , wn-i is 

N-d-l 

^aAd)[w] = , J2 Sa{Wi)6p{Wi+d) - Pa{w)pp{w) 

" 1=0 

We computed ro,o for TTry, tTws, '^mk- The most notorious result of this com- 
putation is the strong oscillation due to period 3; this can be removed by 
considering the smoothed version, Vapid) = \ YH=d-i^ af3{i)] when this was 
done, the periodicity of period 10.5 could also be seen. To give an idea of the 
shape of the curves, and to show their specificity. Graphic 4b shows the results 
for TTtys, for SI, £'5, and the same subsets of E used in Graphic 4a. In general, V 
behaves very similar to the Fourier transform, in specificity and in the depen- 
dencies on alphabet and/or codon position. This is no surprising, since both 
express the same information (if T is computed for a circular sequence, then 
it can be recovered form the spectra, and vice versa, by the Wiener-Khinchin 
theorem). Position dependent autocorrelation functions were also computed, 
with no unexpected results. 



4 Backtranslation strategy 

4..1 Genomic style beyond codon usage 

We will consider all of the coding statistics reviewed in the previous section as 
features defining the genomic style of a species. It is important to notice that 
they are not (or not directly) dependent on the codon usage; if this were the 
case, then genomic style would reduce to RSCU, and the current approach to 
backtranslation would be already optimal. 

It is clear that p^ and are recovered by RSCU, if the amino acid composition 
is kept constant (this is the case in (51) and /5g'^(£^5)); in general, since 
amino acid composition is rather similar in all the different species (data not 
shown), we can expect nucleotide frequencies to be conserved. 

For dinucleotides, this is not so clear, even if the amino acid frequencies are 
kept: in spite of recovering the number of dinucleotides starting at the first and 
second codon positions, RSCU will not recover those starting at the third. This 



10 



is important, since most of the degeneracy is in this position, and "genomic 
style" depends strongly on it; moreover, mutation rates tend to be affected 
by the neighboring nucleotides [2,16], in ways that are species-dependent. In 
particular, Miramontes et al [25] show that their indices (IDH) are not deter- 
mined by codon usage, even when the amino acid frequency was conserved. 
Our data (not shown) confirm it. 

As for the Fourier spectra, Guigo [10,11] shows that it is rather independent 
from Pg+c- To discard dependence on RSCU, we computed the spectra on 
13'^iiBl), Pfi{E5), PW^iEb) and /5f5(Sl); results for S^on^^s are displayed in 
Graphic 5a. We can see that all the sets of guesses lie between the real spectra, 
with codon usage being a bit more relevant than the amino acid sequences (the 
species); this was also the case for Tiry and Timk (data not shown). Although 
the autocorrelation function contains the same information as the spectrum, 
the details of each one are the main lines of the other, and thus, each may be 
considered apart. Graphic 5b displays computations of ro,o°^«;s over the same 
sets; it can be noticed that in this case the species (amino acid sequences) are 
the major contribution, with only a small effect of RSCU. 




Fig. 5. (a) and (b) f for the tt^^ projection of Bl, E5, Pfi{Bl), Pf^iEb), 
/3f,{Bl) and Pf,{E5). 



4-2 Genetic Algorithms for Backtranslation 

We want to obtain a backtranslation that imitates the genomic style of a target 
species as close as possible; thus, we will look for a backtranslation for which 
the coding statistics listed above are close to those of the target species, i.e., 
their distance is minimum. We choose, for w e B^*, 

h{w) = \pg^,{w) - p;+J h{w) = Ec6i.3 \RSCUc{w) - RSCUh\ 

h{w) = \dry{w) - dly] + \d^s{w) - + \dmk{w) - 

h{w) = E^'ia ak\~si{w) - ~s'^*\ h{w) = Ef=2 hniw) - n\ 
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where the values with "*" are obtained averaging over the known coding se- 
quences of the target species, and and 6^ are weights, incorporated in order 
to give more importance to some parts of the curves, e.g. to encourage a uni- 
form convergence. The indices in the sums of S and F foUow our particular 
choices of window lengths 256 and 30, respectively. 



With these definitions, what we want, for a given u & A* and a given target 
species, is to minimize /(w), with w e t''^{u). There are two main difficulties 
involved. First, we have a non-convex problem, in a vast search space, with 
terms depending on several scales of the sequences. Moreover, it is a problem of 
multiobjective optimization. For these reasons, we propose the use of genetic 
algorithms [17] (GA), specially suited for problems with these characteristics. 
Our particular implementation of a genetic algorithm for backtranslation fol- 
lows here. 



• for 1 <i <n initialize = (5'^{u) 

• while not stop condition 

■ for 1 < j < 5, fj^ maxj fj{w'') 

■ for l<i<n, 1 < J < 5, N} = 



for l<i<n, N'^ )^jN} 

Update P using {N^} [stoch. univ. sampling] 

Apply genetic operators: crossover and mutation 



For a given u E A*, we iterate on a population of n guesses of (?./,), denoted 
by {w^}. As seen in the scheme, our initial condition is the usual backtransla- 
tion (imitation of RSCU); the GA is iterated then to optimize coding statistics. 
Nj are the expected number of copies of a guess in the next generation; pon- 
derating them with we combine the different objective functions, without 
needing to make their numeric values comparable. The genetic operators used 
are crossover and mutation, both adapted to maintain the encoded amino acid 
sequence u. In addition, the probability of crossover between two guesses 
and depends on the Hamming distance between them, making crossing be- 
tween distant guesses less probable (this is introduced in order to encourage 
the exploration of a bigger region in search space). 



A special feature of this approximation is the use of the candidate solutions 
(guesses) as their own encodings for the GA. Of course, this is made possible 
by the sequential and digital nature of genetic sequences, which were the very 
inspiration of GA and other forms of evolutionary computation. Obvious as 
it may seem, this is the only application we know about in which genetic 
algorithms are applied to genetic sequences. 
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4-3 Results of GA application 



The genetic algorithm was run several times for randomly selected sequences 
of Bl and E5 (with the other species as target, in each case), in order to find 
the best values for its parameters (mutation and crossover rates, population 
size, etc.), for the pondcrations, etc.; this was done first for each /j, and then 
for the combined optimization (detailed data can be found at [1]). Even when 
a single function was optimized, we computed all the statistics on the resulting 
guesses, in order to see the effect of each statistics on the rest. Optimization of 
spectra and autocorrelation functions, for instance, do not have the same effect 
on the sequence, in spite of working with the same information. Optimization 
of S causes strong oscillations in F, whereas optimization of T alone tends 
to cause a flattening of S. In general, imitation of F is the most difficult, 
followed by S, with Pg+c, RSCU and specially IDH being the easier. The 
joint optimization of the /j arrived at values of each /j only slightly worse 
than those obtained in single function optimization, with the exception of /4, 
which was actually better. Optimization of p^+c and RSCU appeared to be 
almost unnecessary: when only /s, and /s were considered (with /3™ as initial 
condition), the flnal pg+c and RSCU were still closer to the target species than 
the original sequence was to its own. In general, all /j are optimized by the 
genetic algorithm; it is even possible to make the periodicity of period 10.5 
appear in sequences from which it was absent. 



Fig. 6. S^^ for the iTry projection of B\, , Pf^iW^), P*bi{W^) and W^. 

To remove the differences due to the amino acid sequences (which can strongly 
influence any coding statistic in a sample with just a few sequences), we con- 
structed a test set with sequences encoding homologue proteins in Bl and E5. 
To do this, we extracted from the euGenes database [7] the list of homologies 
between these species, chose the cases with a higher identity percentage, and 
cut the segment of each sequence corresponding to the alignment. Thus we 
obtained a set = {wf , . . . , w^} of sequences from Bl, and another set 
= {wf, . . . , W2q} from E5, with each pair wj^, wf encoding very similar 
amino acid sequences. We performed a canonical backtranslation on t{W^), 
obtaining /3f^(VF^); we perform also a backtranslation by means of our ge- 




13 



netic algorithm, obtaining what we will call f3]^i{W^). The computation of 
the diverse coding statistics allows us to see how this procedure gets the back- 
translation closer to the average style of Bl; moreover, since we do have W^, 
we can compare with the values of that particular set of Bl. For instance, for 
IDH, wc can compute a distance between two sets of sequences 5*1 and 5*2 as 

didh{Si, S2) = \dry{Si) — dry{S2)\ + \dws{Si) — dyjs{S2)\ + \d„ik{Si) — dmk{S2)\- 

We obtain that didhiW^,W^) = 0.275, while didhipBii^^),^^) = 0-104, 
and dicihiPBii^^)^^^) — 0.049. Something similar happens with the other 
statistics. Graphic 6 shows the graphs of S^^oTiry for the different sets; we can 
see again how (3* builds a preimage for the image of (which is a typical E5 
subset) which is far more similar to 51 and than the usual backtranslation 
procedure, /?™. For F the results are similar, but not so easy to observe in the 
graphics; instead of that. Table 3 displays the average difference between the 
curve t{W^), and those for W^, (3Wi(W^) and (3*bi(W^). Again, (3* improves 
with respect to /S*^". 



Table 3: Average distance of curves f 



Projection 










0.0018 


0.0013 


0.0008 




0.0016 


0.0019 


0.0011 



5 Discussion 

The purpose of this article is to propose an improvement of the current pro- 
cedures of protein backtranslation, through the inclusion of coding statistics 
other than RSCU which contribute to characterize the different genomes; this 
can be accomphshed by the use of genetic algorithms. We first presented sev- 
eral known coding statistics, showing their idiosyncratic nature. Then we pro- 
posed a particular implementation of genetic algorithms, for a small set of 
coding statistics; this is only an example, since other choices of the statis- 
tics, or other implementations of evolutionary computation, may give better 
results. Our implementation, which is available at [1], does already produce 
backtranslations which mimic the coding statistics of the target species, in 
ways that are not automatically reproduced by RSCU imitation. 

The definitive test for our approach would be the use of our procedure for 
the in vitro generation of actual artificial genes: we expect it to have a higher 
success frequency than the canonical backtranslation. Meanwhile, the in silico 
experiment consisting in the backtranslation of a human protein into "bacte- 
rial" style, and the comparison of the statistics of the resulting gene to those 
of an homologue bacterial gene (see section 4.3), suggest that our approach 
is correct. In fact, the "optimized" preimages had more exact matches with 
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the bacterial genes (at the ahgned codon positions) than the simple RSCU- 
based backtranslation; this happened when human proteins were optimized for 
"bacterial style", and also when bacterial proteins were translated into "hu- 
man". Though small, the systematic increase in exact matches is surprising: 
we did not expect the imitation of coding statistics to have this effect, since 
the number of preimages satisfying a given profile is still huge. 

This increase in exact matches suggests that the algorithm could be also ap- 
plied to the problem of "gene fishing" through PGR reactions primed by de- 
generate primers, or "guessmers" . This is a particular case of backtranslation, 
limited to short sequences selected for their minimal ambiguity. Thus, cod- 
ing statistics are hard to evaluate (sequences are short) and hard to optimize 
(sequences are rigid). In spite of these difficulties, preliminary in silico exper- 
iments seem to support this application. 

Another field of application for the ideas presented here is the analysis of 
sequences: discussions on the relations and origins of coding statistics can be 
illuminated by massive backtranslation of sequences under some criteria, like 
we did in 4.1 with RSCU to study its relation to spectra and autocorrelation 
functions. Of special interest are the comparisons between genes suspected, or 
known, to be related by horizontal transfer [34]. Values of RSCU and/or p^+c 
divergent from the style of a genome have been used to detect horizontally 
transferred genes; the degree of their divergence has been used as a clock 
to determine when a gene was acquired[33]. Some authors[20] have done this 
through a "reverse amelioration" which is a kind of backtranslation, and could 
be enriched by the results and procedures given here. 
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